Part 1 Semiconductor manufacturing process

Import and explore the data

Data cleansing, Data analysis & visualisation

Data pre-processing, Model training, testing and tuning

Classification on Original Data

The Lucifer ML uses SMOTE (Oversampling) method before applying hyperparameter tuning.

CatBoost Classifier has highest

Cross validation techniques

Train-Test Split

K-fold

Repeated k-fold

Leave-one-out

Stratified k-fold

SMOTE to upsample smaller class

UpSample smaller class

Random Oversampling

Undersampling Using ClusterCentroids

PCA_Original Data

PCA_Oversampling Data

PCA_Undersampling Data

Pipeline

PCA_Pipeline

PCA_Oversampling Data_Pipeline

PCA_Undersampling Data_Pipeline

  Classification algorithms gave best accuracy score for Random Oversampling Data

1.Classification score on Original data : 0.92 2.Classification score on Random Over Sampling data : 1 3.Classification score on Under Sampling data : 0.81 4.Classification score on PCA original data : 0.93 5.Classification score on PCA Random Over Sampling data : 0.93 6.Classification score on PCA Under Sampling data : 0.91 7.Classification score on Pipeline data : 0.95 8.Classification score on PCA Pipeline data : 0.94 9.Classification score on PCA Pipeline Random Over Sampling data : 0.69 10.Classification score on PCA Pipeline Under Sampling data : 0.85

Now lets check the score generated by the AUTO ML by H2O

Gradient Boosting Machine has highest accuracy for Sensor data

Future Data

Binary Logistic Regression has got highest Rawscore for Future Oversampling synthetic data.

Using Random oversampling the future data has been generated. Using SDV the synthetic data has been generated on Oversampling data.We got 60% score on Binary logistic regression from synthetic over sampling data.

Conclusion and improvisation

Bivariate graphs show that, sensors stopped working due to some technical issue. The data generated was zero in that specific time. Random Oversampling and SMOTE were best Cross validation techniques which has got highest accuracy score. From various Hyper Parameter tuning techniques the Gradient Boost and Support vector machine has highest accuracy score. PCA dimenstionality reduction has not improved the modelling in this data set. However, the components which were analysed has got the best scores despite of reducing the data.

Multiple rows in the data set are not fit for analysis. Though, the data set is huge but it was very easy to train, validate and test.